5 DMA
COPROCESSOR
5.1 DESIGN OF
THE DMA COPROCESSOR
COPROCESSOR PHILOSOPHY
A coprocessor is a slave CPU which computes in
parallel with the master CPU.
Floating point operations are often handled by a coprocessor to free up
the master CPU for other work. Many
applications require high speed I/O with minimum CPU overhead. Special DMA, video display, and network
control chips are usually used.
These chips add expense and complexity to the application. ShBOOM provides a coprocessor to dispatch
input/output with the off-chip world with minimum interaction from the master
CPU. This unit is called the DMA
Coprocessor.
On the same die with the ShBOOM CPU, the DMA
Coprocessor acts as a loosely coupled slave processor. The coprocessor shares memory space with
the CPU but has its own program counter and instruction set. Contrary to the conventional coprocessor
design, the DMA Coprocessor always has bus priority so it can dispatch
input/output in a time predictable manner with minimum master CPU
overhead.
Coprocessor I/O devises are memory mapped in the
ShBOOM memory space. Simple
decoding of high order address bits output by the Transfer Address Register
(See below) selects a specific device.
THE DMA COPROCESSOR
The DMA Coprocessor is a self contained CPU which
fetches and executes a special instructions set from memory shared with the
ShBOOM CPU. The instructions move
data between the specified I/O devices and ShBOOM dynamic RAM. The coprocessor continuously executes instructions,
and may run without interaction from the ShBOOM CPU.
The coprocessor has a 20-bit program counter and the
coprocessor memory space overlaps the ShBOOM CPU space starting at location
zero. An important feature of the
coprocessor is that its input/output transfers are predictably in time. The coprocessor takes precedence over
any other activity (interrupt, master CPU data fetch, or instruction prefetch)
which might request the system bus.
The coprocessor is therefore very useful for generating
and interpreting video timing signals, disk drive interfacing, and network
signals which must happen at precise intervals. In a ShBOOM system, the coprocessor can
perform tasks including DRAM refresh, high quality sound generation, MODEM
signal generation and reception, or multiprocessor communications.
5.2 DMA
COPROCESSOR REGISTERS
Figure 5.1 shows the coprocessor block diagram and
registers. Requests for the bus
from the ShBOOM CPU must pass through the coprocessor. This allows the coprocessor to maintain
the highest priority for bus access.
CPC (COPROCESSOR PROGRAM COUNTER)
The 20-bit Coprocessor Program Counter points to
instructions which reside in the first 1 megabytes of the SHBOOM memory. (Since the CPC counts in increments of
32-bit instructions, the least significant bits of the CPC are always
zero.) A coprocessor program may be
as short as a single 32-bit instruction (JUMP to itself) or as long as the
entire 1 megabyte coprocessor instruction space.
The CPC is incremented by four after execution of
each instruction and can be loaded by a JUMP instruction or by a write
operation from the ShBOOM CPU.
FLAG
FLAG is a single bit register which can be polled by
the SHBOOM CPU to determine when a specific coprocessor has executed. FLAG is set or reset by the JUMP
instruction.
TIR (TRANSFER INTERVAL REGISTER)
The TIR is a 12-bit register which counts the number
of system clock pulses between timed input/output transfers. For example, in a video display, while
displaying a scan line the TIR would correspond to the time delay representing
the pixel rate divided by 32-bits.
The TIR is a write only register and can be loaded by
a LOAD or MOVE or instruction. It
is decremented by each system clock.
The TIR cannot be read or written by the ShBOOM CPU.
TSC (TRANSFER SIZE COUNTER)
The TSC is a 12-bit register which counts the number
of input/output transfers scheduled by a LOAD or MOVE instruction. It is decremented each time an
input/output transfer is performed.
The TSC is write only. The TSC cannot be read or written by the ShBOOM
CPU.
TAR (TRANSFER ADDRESS REGISTER)
The TAR is a 32-bit register which points to the I/O
memory source or destination scheduled by a LOAD or MOVE instruction. It is incremented each time an
input/output transfer is performed.
The TAR is write only. The TAR cannot be read
or written by the ShBOOM CPU.
5.3 DMA
COPROCESSOR INSTRUCTIONS
The coprocessor fetches and executes the following
32-bit instructions:
LOAD
MOVE-RAM-I/O
MOVE-I/O-RAM
REFRESH
READ&PACK
TEST-WAIT-WRITE
TEST-WAIT-READ
TEST-WRITE
TEST-READ
JUMP
Bits of the Binary Op Codes are interpreted as
follows:
R - Refers to a bit destined for the
Transfer Address Register
S - Refers to a bit destined for the
Transfer Size Counter
X - A bit which sets or resets the FLAG
register
Y - A bit which is ORed with its corresponding
I/O Register bit and
sent to a corresponding
output latch
N - Cause an interrupt is set to
"1" in a JUMP.
P - Refers to a bit destined for the
Coprocessor Program Counter
I - Refers to a bit destined for the Transfer
Interval Register
LOAD
Binary Opcode RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10
The LOAD instruction loads the value specified by
(RRRR...) into the Transfer Address Register.
MOVE-RAM-I/O
Binary Opcode
SSSS SSSS SSSS IIII IIII
IIII 0110 0000
This instruction performs the following functions:
Reads the number of 32-bit
words specified by (SSSS...) from the memory location pointed to
by the Transfer Address Register,
Waits the interval of system
cycles specified by (IIII...) between transfers.
Writes the number of 32-bit
words specified by (SSSS...) to the selected I/O device.
MOVE-I/O-RAM
Binary Opcode
SSSS SSSS SSSS IIII IIII IIII 0100
0000
This instruction perform the following functions:
Reads the number of 32-bit
words specified by (SSSS...) from the selected I/O device,
Waits the interval of system
cycles specified by (IIII...)
between transfers.
Writes the number of 32-bit
words specified by (SSSS...) to the memory location pointed
to by the TAR.
REFRESH
Binary Opcode SSSS SSSS SSSS IIII IIII IIII 0110 1100
This instruction does the following:
Performs the number of CAS
before RAS refresh cycles specified by (SSSS...),
Waiting the interval of
system cycles specified by (IIII...) between refresh cycles.
READ&PACK
Binary Opcode
SSSS SSSS SSSS IIII IIII
IIII 0000 0000
This instruction does the following:
Reads an 8-bit byte applied
to the low order data lines of the selected I/O device,
Waits the interval of system
cycles specified by (IIII...),
Repeats the process three more times, shifting
each byte eight places to build a 32-bit value,
Writes the 32-bit value to
the location specified by the Transfer Address Register,
Repeats the process the
number of times specified by (SSSS...).
READ&PACK is performed automatically upon
power-on reset to load the bootstrap program from a byte-wide PROM. The instruction is also useful for
interfacing with 8-bit peripherals such as HDTV A/D converters.
TEST-WAIT-WRITE
Binary Opcode RRRR
RRRR RRRR RRRR RRRR RRRR RRRR RR10
TEST-WAIT-WRITE tests the specified input line for a
logic "0" level. If the
line is a logic "1", coprocessor execution halts until the line
becomes a "0".
When the line is pulled low to "0", the
specified input port is read and the data is written to address specified by
(RRRR .....) in the instruction.
The address specifier (RRRR ......) bits of the instruction are
incremented in memory. If the least
significant 12 bits of the address are zero, the instruction will act as a NOP
when next executed.
TEST-WAIT-READ
Binary Opcode RRRR
RRRR RRRR RRRR RRRR RRRR RRRR RR10
TEST-WAIT-READ tests the specified input line for a
logic "0" level. If the
line is a logic "1", coprocessor execution halts until the line
becomes a "0".
When the line is pulled low to "0", the
data at the address specified by (RRRR ......) is read and written to the
specified output port. The address
specifier (RRRR .....) bits of the instruction are incremented in memory. If the least significant 12 bits of the
address are zero, the instruction will act as a NOP when next executed.
TEST-WRITE
Binary Opcode RRRR
RRRR RRRR RRRR RRRR RRRR RRRR RR10
TEST-WRITE test the specified input line for a logic
"0" level. If the line is
a logic "1", the instruction acts as a NOP.
If the line is pulled low to "0", the
specified input port is read and the data is written to the address specified
by (RRRR .....) in the instruction.
The address specifier (RRRR .....) bits of the instruction are
incremented in memory. If the least
significant 12 bits of the address are zero, the instruction will act as a NOP
when next executed.
TEST-READ
Binary Opcode RRRR
RRRR RRRR RRRR RRRR RRRR RRRR RR10
TEST-READ tests the specified input line for a logic
"0" level. If the line is
a logic "1". the instruction acts as a NOP.
If the line is pulled low to "0", the data
at the address specified by (RRRR .....) is read and written to the specified
output port. The address specifier
(RRRR .....) bits of the instruction are incremented in memory. If the least significant 12 bits of the
address are zero, the instruction will act as a NOP when next executed.
JUMP
Binary Opcode
0000 0000 NYYX PPPP PPPP
PPPP PPPP PP11
JUMP loads the Coprocessor Program Counter with the
specified 18-bit value (PPPP...).
The next coprocessor instruction will be fetched from the specified word
address. The Coprocessor Flag Register
is cleared to "0" if X=0 and set to "1" otherwise. FLAG provides communications to the
SHBOOM CPU that a JUMP has been executed.
For example, at the end of a video scan line, setting
FLAG can inform the CPU to set up the next line.
If N = "1", the ShBOOM CPU will be
interrupted when JUMP executes. The
CPU will push the current Program Counter onto the Return Stack and set the
Program Counter to "0" and continue executing.
YY is ORed with bits 0 and 1 from the I/O register
and then output to the I/O latch.
In this way, execution of the JUMP instruction can make one or both
external lines go high.
5.4 PROGRAMMING
CONSIDERATIONS
DRAM REFRESH
DMA Coprocessor instructions can take thousands or
millions of cycles to execute. A dynamic
RAM must be entirely refreshed every 8 msec. The system designer must make certain
that provisions for distributed or lumped refresh have been made consistent
with the specifications of the DRAM in use.
In most I/O configurations, there are natural
opportunities for refresh that may be easily exploited by the coprocessor. In a video display, the time during
horizontal retrace is suitable. In
a disk controller, the end of a sector read or write might be convenient for a
burst refresh.
BUS BANDWIDTH CONSIDERATIONS
The most speed limiting resource of a computer system
is memory bus bandwidth. ShBOOM is
very efficient with bus bandwidth by fetching four instructions in each memory
cycle. However, the DMA Coprocessor
fetches instructions and data using the same bus.
The coprocessor always maintains priority on bus
availability so its I/O transfers can be precise. The more frequently the coprocessor
makes I/O requests, the less bandwidth is available for the ShBOOM CPU. It is possible in an extreme case for
the coprocessor to completely block all bus access.
In most applications, the coprocessor will spend very
little bus time fetching and executing its instructions, since an instruction
may take up to 16 million internal machine cycles to execute before the next
bus cycle is required.
For example, in a simple video application driving a
50 MHZ display with a shift register, input/output transfers will consume about
9 per cent of the bus bandwidth.
After each I/O transfer, the ShBOOM CPU will probably have to initiate a
memory RAS cycle to reset the fast page mode to point to the current
program. Systems such as hard disc
or network controllers will use about 1 per cent of the bus bandwidth.
INITIALIZATION
ShBOOM fetches and executes instructions from
DRAM. On power up, the first
operation is to transfer the program to be executed from PROM to the DRAM.
When power is first applied or when the RESET line is
pulled low, the following RESET sequence is performed to initialize the
internal registers, to load the program from PROM into DRAM, and to begin
execution:
1. Block execution of the ShBOOM CPU,
2. Clear the Coprocessor Program Counter to "0",
3. Clear the Transfer Address Register to "0",
4. Set the Transfer Size Counter to hex "FFF",
5. Set the Transfer Interval Register to hex "7",
6. Enable execution of the I/O Coprocessor, (The coprocessor will load
12 bytes
from the external PROM into the lowest ShBOOM memory locations.)
7. Set the ShBOOM CPU Program Counter to HEX byte "....0008",
8. Enable execution of the ShBOOM CPU.
The second and third 32-bit instructions loaded from
the PROM will place the coprocessor in a refresh loop. The ShBOOM CPU will begin executing the
code loaded by the coprocessor. The
CPU will usually direct the coprocessor to read in the rest of the PROM contents
using READ&PACK.
COMMUNICATIONS WITH THE ShBOOM CPU
The coprocessor can be initialized by the ShBOOM CPU
and then left to run independently.
The CPU directs the coprocessor by writing instructions into the shared
memory. The CPU can direct the
coprocessor to begin executing a different program by writing a JUMP
instruction to the new coprocessor program.
The coprocessor communicates to the ShBOOM CPU with
the FLAG bit and by a dedicated interrupt.
The coprocessor JUMP instruction can set or reset the FLAG bit when the
JUMP is taken. This bit may then be
polled by the CPU.
If the interrupt bit is set in the JUMP op-code, at
the completion of the next 4-byte instruction group, the CPU will push the
Program Counter on the Return Stack and JUMP to location "0" to
execute the I/O coprocessor interrupt routine.
TIMED DMA TRANSFERS
Transfers timed by the internal system clock allow
the creation of streams of data at precise intervals or the sampling of streams
of input at precise intervals.
A video signal is one of the most demanding timed DMA
transfers. The scan line pixels
must be output precisely to avoid display distortion.
In another example, a sin wave can be generated by
outputing a consecutive group of numbers to a Digital to Analog converter at
precise intervals. In fact any
waveform can be reproduced in this fashion.
In a MODEM receiver application, timed DMA input
transfers can sample an Analog to Digital converter at precise intervals and
store the signal values for digital signal processing into binary data by the
ShBOOM CPU.
The I/O Coprocessor takes precedence over all other
bus activities, so that timed transfers will always be precise.
CLOCKED DMA TRANSFERS
Transfers clocked by one of eight external clock
lines allow the transfer of streams of data upon demand of an external device.
A typical application of a clocked transfer is found
in a disc drive. When reading
a sector, the disc would switch a clock line to indicate that data was ready to
be read. The coprocessor would then
transfer the data directly into memory.
During a write operation, the drive would switch
another line indicating it was ready to receive the next data. The coprocessor would read the next data
from memory and transfer it directly to the disc.
Local Area Network interfaces are other excellent applications
for clocked DMA transfers.
CLOCKED DMA I/O VERSUS INTERRUPT DRIVEN I/O
Interrupts are often used in minicomputer and
microprocessor systems to control I/O transfers. For example, a UART interrupts a
microprocessor to indicate a byte of information is ready. The microprocessor saves its current
state and jumps to the interrupt service routine.
The routine usually reads the byte, stores it into
the receive buffer, and then restores the microprocessor's preinterrupt
state. The overhead of state
saving, jumping to routine, and then returning reduces the efficiency of
interrupt initiated transfers by a factor of from ten to thirty compared to
clocked DMA.
Interrupt response time is a greatest I/O limitation
in many microprocessors. The
response time of ShBOOM clocked DMA is one memory cycle for the highest
priority device (120 nsec.), and a worst case of eight cycles for the lowest
priority.
The ShBOOM chip offers eight high speed clocked DMA
channels as the most efficient method of transferring data to and from
peripheral devices.
A ShBOOM interrupt waits for the current four byte
instruction group to finish executing before responding. The worse case response time would be
encountered if four multiply instructions occurred in a row and the I/O
coprocessor made heavy demands on the bus.
It is thereby theoretically possible to delay an internal interrupt by
as much as 4 microseconds.
5.5 EXAMPLES OF
THE DMA COPROCESSOR USAGE
Video Display
Embedded controllers are always controlling I/O,
often at high speed. The I/O
coprocessor can be programmed to automatically retrieve pixels from a screen
buffer, output the pixels to a video shift register, generate sync signals,
generate retrace, and continuously repeat until instructed differently by the
ShBOOM CPU.
Because the coprocessor has bus priority, the display
is flicker free regardless of CPU operations including interrupts.
Hard Disk Controller
The coprocessor can read data directly from a hard disk
and place it in a temporary buffer.
While the next sector is being read, the CPU can perform error checking
or decompress the data and place it in the sector buffer.
The same CPU can compress outgoing data, place it in
a temporary buffer, and notify the coprocessor. While the CPU works on compressing the
next sector, the coprocessor can position the head, wait for the correct
sector, and write out the data.
Lan Network or PBX Controller
As a high-speed data concentrator, the coprocessor
can receive input from eight 10 Mbit/sec lines and store the received data in
eight different temporary buffers.
The CPU can be notified when any buffer is filled. Writing out to the lines can be
similarly performed.
Scalable Font Generator
In a graphics printer application, the CPU can
generate fonts on the fly, place the pixels in a temporary buffer, and notify
the coprocessor which outputs them to the print head or laser. As a result on computing the fonts on
the fly, a scalable font printer can be produced with one tenth the memory
required by other systems.
Universal MODEM
The coprocessor reads the output from the A/D into a
circular temporary buffer. The CPU
can perform the DSP algorithms while A/D input is continuously read. The CPU can also perform error
correcting and decompression.
Output to the D/A can be performed concurrently with input by the
coprocessor.